12 research outputs found

    Results of the WMT19 metrics shared task: segment-level and strong MT systems pose big challenges

    Get PDF
    This paper presents the results of the WMT19 Metrics Shared Task. Participants were asked to score the outputs of the translations systems competing in the WMT19 News Translation Task with automatic metrics. 13 research groups submitted 24 metrics, 10 of which are reference-less "metrics" and constitute submissions to the joint task with WMT19 Quality Estimation Task, "QE as a Metric". In addition, we computed 11 baseline metrics, with 8 commonly applied baselines (BLEU, SentBLEU, NIST, WER, PER, TER, CDER, and chrF) and 3 reimplementations (chrF+, sacreBLEU-BLEU, and sacreBLEU-chrF). Metrics were evaluated on the system level, how well a given metric correlates with the WMT19 official manual ranking, and segment level, how well the metric correlates with human judgements of segment quality. This year, we use direct assessment (DA) as our only form of manual evaluation

    Large expert-curated database for benchmarking document similarity detection in biomedical literature search

    Get PDF
    Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency-Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.Peer reviewe

    Operationalizing content moderation "accuracy'' in the Digital Services Act

    Full text link
    The Digital Services Act, recently adopted by the EU, requires social media platforms to report the ``accuracy'' of their automated content moderation systems. The colloquial term is vague, or open-textured -- on what data and ground truth labels are we measuring accuracy? In addition, the literal accuracy (number of correct predictions divided by the total) is not suitable for problems with large class imbalance. Without more detailed specification, the regulatory requirement allows for deficient or even malicious reporting from companies. In this interdisciplinary work, we operationalize ``accuracy'' reporting by both refining legal concepts and providing technical implementation. After addressing underspecification, we propose more appropriate accuracy measures by relating problems of overmoderation and undermoderation to low precision and low recall, respectively. While estimating precision is statistically straightforward, estimating recall poses a challenging statistical problem. Naive estimation can incur extremely high annotation costs, which would be unduly burdensome and interfere disproportionately with the platform's right to conduct business. We introduce stratified sampling with trained classifiers, and we show it greatly improves efficiency of recall estimation compared with random sampling on the CivilComments dataset. From this, we provide concrete recommendations in applying stratified sampling to improve efficiency. Using this improved estimator, we study moderation recall of personal attacks on different subreddits, and calculate realistic statistics to demonstrate such a reporting requirement in practice. We conclude by relating our legal and technical analysis to guidelines for future legal clarification and implementation.Comment: In submission to ICWSM 202

    Large expert-curated database for benchmarking document similarity detection in biomedical literature search

    No full text

    Risk for Major Bleeding in Patients Receiving Ticagrelor Compared With Aspirin After Transient Ischemic Attack or Acute Ischemic Stroke in the SOCRATES Study (Acute Stroke or Transient Ischemic Attack Treated With Aspirin or Ticagrelor and Patient Outcomes)

    No full text
    International audienc

    Risk for Major Bleeding in Patients Receiving Ticagrelor Compared With Aspirin After Transient Ischemic Attack or Acute Ischemic Stroke in the SOCRATES Study (Acute Stroke or Transient Ischemic Attack Treated With Aspirin or Ticagrelor and Patient Outcomes)

    No full text

    Prospective observational cohort study on grading the severity of postoperative complications in global surgery research

    Get PDF
    Background The Clavien–Dindo classification is perhaps the most widely used approach for reporting postoperative complications in clinical trials. This system classifies complication severity by the treatment provided. However, it is unclear whether the Clavien–Dindo system can be used internationally in studies across differing healthcare systems in high- (HICs) and low- and middle-income countries (LMICs). Methods This was a secondary analysis of the International Surgical Outcomes Study (ISOS), a prospective observational cohort study of elective surgery in adults. Data collection occurred over a 7-day period. Severity of complications was graded using Clavien–Dindo and the simpler ISOS grading (mild, moderate or severe, based on guided investigator judgement). Severity grading was compared using the intraclass correlation coefficient (ICC). Data are presented as frequencies and ICC values (with 95 per cent c.i.). The analysis was stratified by income status of the country, comparing HICs with LMICs. Results A total of 44 814 patients were recruited from 474 hospitals in 27 countries (19 HICs and 8 LMICs). Some 7508 patients (16·8 per cent) experienced at least one postoperative complication, equivalent to 11 664 complications in total. Using the ISOS classification, 5504 of 11 664 complications (47·2 per cent) were graded as mild, 4244 (36·4 per cent) as moderate and 1916 (16·4 per cent) as severe. Using Clavien–Dindo, 6781 of 11 664 complications (58·1 per cent) were graded as I or II, 1740 (14·9 per cent) as III, 2408 (20·6 per cent) as IV and 735 (6·3 per cent) as V. Agreement between classification systems was poor overall (ICC 0·41, 95 per cent c.i. 0·20 to 0·55), and in LMICs (ICC 0·23, 0·05 to 0·38) and HICs (ICC 0·46, 0·25 to 0·59). Conclusion Caution is recommended when using a treatment approach to grade complications in global surgery studies, as this may introduce bias unintentionally

    The surgical safety checklist and patient outcomes after surgery: a prospective observational cohort study, systematic review and meta-analysis

    Get PDF
    © 2017 British Journal of Anaesthesia Background: The surgical safety checklist is widely used to improve the quality of perioperative care. However, clinicians continue to debate the clinical effectiveness of this tool. Methods: Prospective analysis of data from the International Surgical Outcomes Study (ISOS), an international observational study of elective in-patient surgery, accompanied by a systematic review and meta-analysis of published literature. The exposure was surgical safety checklist use. The primary outcome was in-hospital mortality and the secondary outcome was postoperative complications. In the ISOS cohort, a multivariable multi-level generalized linear model was used to test associations. To further contextualise these findings, we included the results from the ISOS cohort in a meta-analysis. Results are reported as odds ratios (OR) with 95% confidence intervals. Results: We included 44 814 patients from 497 hospitals in 27 countries in the ISOS analysis. There were 40 245 (89.8%) patients exposed to the checklist, whilst 7508 (16.8%) sustained ≥1 postoperative complications and 207 (0.5%) died before hospital discharge. Checklist exposure was associated with reduced mortality [odds ratio (OR) 0.49 (0.32–0.77); P\u3c0.01], but no difference in complication rates [OR 1.02 (0.88–1.19); P=0.75]. In a systematic review, we screened 3732 records and identified 11 eligible studies of 453 292 patients including the ISOS cohort. Checklist exposure was associated with both reduced postoperative mortality [OR 0.75 (0.62–0.92); P\u3c0.01; I2=87%] and reduced complication rates [OR 0.73 (0.61–0.88); P\u3c0.01; I2=89%). Conclusions: Patients exposed to a surgical safety checklist experience better postoperative outcomes, but this could simply reflect wider quality of care in hospitals where checklist use is routine
    corecore